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Description 

[0001] The invention relates to VLIW (Very Long Instruction Word) processors and in particular to instruction formats 
for such processors and an apparatus and method for processing such instruction formats. 
5 [0002] The invention relates in particular to a VLIW processor according to the part of Claim 1 preceding the words 
"characterized in that". 

[0003] A schemefor compression of VLIW instructions has been proposed in US Pat. No. s 5,1 79,680 and 5,057,837. 
This compression scheme provides for an instruction word from which unused operations are eliminated and a mask 
word indicating which operations have been eliminated. 

10 [0004] VLIW processors have instruction words including a plurality of issue slots. The processors also include a 
plurality of functional units. Each functional unit is for executing a set of operations of a given type. Each functional 
unit is RISC-like in that it can begin an instruction in each machine cycle in a pipe-lined manner. Each issue slot is for 
holding a respective operation. All of the operations in a same instruction word are to be begun in parallel on the 
functional unit in a single cycle of the processor. Thus the VLIW implements fine-grained parallelism. 

15 [0005] Thus, typically an instruction on a VLIW machine includes a plurality of operations. On conventional machines, 
each operation might be referred to as a separate instruction. However, in the VLIW machine, each instruction is 
composed of operations or no-ops (dummy operations). 

[0006] Like conventional processors, VLIW processors use a memory device, such as a disk drive to store instruction 
streams for execution on the processor. A VLIW processor can also use caches, like conventional processors, to store 
20 pieces of the instruction streams with high bandwidth accessibility to the processor. 

[0007] The instruction in the VLIW machine is built up by a programmer or compiler out of these operations. Thus 
the scheduling in the VLIW processor is software-controlled. 

[0008] The VLIW processor can be compared with other types of parallel processors such as vector processors and 
superscalar processors as follows. Vector processors have single operations which are performed on multiple data 
25 items simultaneously. Superscalar processors implement fine-grained parallelism, like the VLIW processors, but unlike 
the VLIW processor, the superscalar processor schedules operations in hardware. 

[0009] Because of the long instruction words, the VLIW processor has aggravated problems with cache use. In 
particular, large code size causes cache misses, i.e. situations where needed instructions are not in cache. Large code 
size also requires a higher main memory bandwidth to transfer code from the main memory to the cache. 
30 [0010] Large code size can be aggravated by the following factors. 

In order to fine tune programs for optimal running, techniques such as grafting, loop unrolling, and procedure 
inlining are used. These procedures increase code size. 

Not all issueslots are used in each instruction. A good optimizing compiler can reduce the number of unused issue 
35 slots; however a certain number of no-ops (dummy instructions) will continue to be present in most instruction 

streams. 

In order to optimize use of the functional units, operations on conditional branches are typically begun prior to 
expiration of the branch delay, i.e. before it is known which branch is going to be taken. To resolve which results 
are actually to be used, guard bits are included with the instructions. 
40 - Larger register files, preferably used on newer processor types, require longer addresses, which have to be in- 
cluded with operations. 

[0011] Aschemeforcompression of VLIW instructions has been proposed in US Pat. No. s 5,1 79,680 and 5,057,837. 
This compression scheme eliminates unused operations in an instruction word using a mask word, but there is more 
45 room to compress the instruction. 

[0012] Further information about technical background to this application can be found in the following prior appli- 
cations: 

US Application Ser. No. 998,090, filed December 29, 1 992 (PHA 21 ,777), which shows a VLIW processor archi- 
ve tecture for implementing fine-grained parallelism; 

US Application Ser. No. 142,648 filed October 25, 1993 (PHA 1205) (US-A-5,450,556), which shows use of guard 
bits; and 

US Application Ser. No. 366,958 filed December 30, 1994 (PHA 21 ,932) (EP-A-0 748 477) which shows a register 
file for use with VLIW architecture. 

55 

Bibliography of program compression techniques: 

J. Wang et al, "The Feasibility of Using Compression to Increase Memory System Performance", Proc. 2nd Int. 
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Workshop on Modeling Analysis, and Simulation of Computer and Telecommunications Systems, p. 107-113 (Dur- 
ham, NC, USA 1994); 

H. Schroder et al., "Program compression on the instruction systolic array", Parallel Computing, vol. 17, n 2-3, 
June 1991, p. 207-21 9; 

5 - A. Wolfe etal., "Executing Compressed Programs on an Embedded RISC Architecture", J. Computer and Software 
Engineering, vol. 2. no. 3, pp 315-27, (1994); 

M. Kozuch etal., "Compression of Embedded Systems Programs", Proc. 1994IEEE Int. Conf. on Computer Design: 
VLSI in Computers and Processors (Oct. 10-12, 1994, Cambridge MA. USA) pp.270-7. 

10 Typically the approach adopted in these documents has been to attempt to compress a program as a whole or blocks 
of program code. Moreover, typically some table of instruction locations or locations of blocks of instructions is neces- 
sitated by these approaches. 

[0013] It is an object of the invention to reduce code size in a VLIW processor. 

[0014] It is another object of the invention to create a VLIW processor which processes more highly compressed 
15 instructions. 

[0015] The processor according to the invention is characterized in that the decompression unit is arranged to de- 
compress operations with respective compressed operation lengths chosen from a plurality of finite lengths, which 
finite lengths include at least two non-zero lengths. The set of available operation lengths is for example 0, 26, 34 and 
42 bit long compressed operations. Which operations are compressed to a particular length depends first of all on a 
20 study of frequency of occurrence of the operations. This could vary depending on the type of software written. Further- 
more the length may be made dependent on whether the operation is guarded or unguarded, whether it produces a 
result, whether it uses an immediate parameter and on the number of operands it uses. 

[0016] The processor according to the invention has an embodiment wherein the decompression unit is arranged to 
take a format field from the compressed instruction medium, the format field specifying the respective compressed 
25 operation length for each operation of the compressed instruction, the decompression unit decompressing the opera- 
tions of the compressed instruction according to the format field. Preferably the format field also specifies which issue 
slots of the processor are to be used by the instruction. 

[0017] In a further embodiment of the processor according to the invention the format field has N sub-fields, N being 
the number of issue slots, each sub-field specifying a compressed operation length for a respective issue slot, char- 
so acterized in that the sub-fields each contain at least two bits. When four different operation lengths are used, the sub- 
field may be for example 2-bit long. 

[0018] In another embodiment of the invention wherein the decompression unit is arranged for 

taking a preceding compressed instruction from the compressed instruction medium together with the format field, 
35 - starting decompression of the preceding compressed instruction and subsequently 

taking the compressed instruction from the compressed instruction memory and starting decompression of the 
compressed instruction according to the format field taken from the compressed instruction medium together with 
the preceding compressed instruction. Thus the format field is available before the compressed instruction is loaded 
and preparations for decompression according to the format field can start before the compressed instruction is 
40 loaded. 



[0019] In an embodiment of the invention, the compression unittakestheformatfieldfromthecompressed instruction 
medium in a memory access unit, the memory access unit also comprising at least one operation part sub-field, the 
decompression unit integrating the operation part sub-field in at least one of the operations of the decompressed 
45 instruction. This increases retrieval efficiency and allows pipelining of instruction retrieval. This may be used in a proc- 
essor capable of decompressing any number of available operation lengths, also two lengths (e.g. 0 and 32). Thus, 
the decompression unit is alerted to how many issue slots to expect in the instruction to follow prior to retrieval of that 
instruction. For each instruction, other than branch targets, a field specifying a format may be stored with a previous 
instruction. 

50 [0020] The invention also relates to a method of producing compressed code for running on a VLIW processor ac- 
cording to Claim 7. This method generates instructions useful for the processor. 

[0021] The method according to the invention has an embodiment in which the method is applied to a stream of 
instructions including said instruction, the method comprising the step of determining for each instruction whether that 
instruction is a branch target of a branch from another instruction of the stream of instructions, and compressing only 
55 those instructions which are not branch targets. Thus branch target will not need to be decompressed during execution 
of a program by the processor and no delay to decompress is needed after execution of a branch instruction. 
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BRIEF DESCRIPTION OF THE DRAWING 

[0022] The invention will now be described by way of n on -I imitative example with reference to the following figures: 

5 Fig. 1 a shows a processor for using the compressed instruction format of the invention. 

Fig. 1 b shows more detail of the CPU of the processor of Fig. 1 a. 
Figs. 2a-2e show possible positions of instructions in cache. 

Fig. 3 illustrates a part of the compression scheme in accordance with the invention. 

Figs. 4a - 4f illustrate examples of compressed instructions in accordance with the invention. 
10 Figs. 5a-5b give a table of compressed instructions formats according to the invention. 

Fig. 6a is a schematic showing the functioning of instruction cache 1 03 on input. 

Fig. 6b is a schematic showing the functioning of a portion of the instruction cache 1 03 on output. 

Fig. 7 is a schematic showing the functioning of instruction cache 104 on output. 

Fig. 8 illustrates compilation and linking of code according to the invention. 
15 Fig. 9 is a flow chart of compression and shuffling modules. 

Fig. 10 expands box 902 of Fig. 9. 

Fig. 11 expands box 1005 of Fig. 10. 

Fig. 12 illustrates the decompression process. 

20 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

[0023] Fig. 1 a shows the general structure of a processor according to the invention. A microprocessor according 
to the invention includes a CPU 102, an instruction cache 103, and a data cache 105. The CPU is connected to the 
caches by high bandwidth buses. The microprocessor also contains a memory 104 where an instruction stream is 
25 stored. 

[0024] The cache 1 03 is structured to have 51 2 bit double words. The individual bytes in the words are addressable, 
but the bits are not. Bytes are 8 bits long. Preferably the double words are accessible as a single word in a single clock 
cycle. 

[0025] The instruction stream is stored as instructions in a compressed format in accordance with the invention. The 

30 compressed format is used both in the memory 1 04 and in the cache 1 03. 

[0026] Fig. 1 b shows more detail of the VLIW processor accordingto the invention. The processor includes a multiport 
register file 150, a number of functional units 151, 152, 153, and an instruction issue register 154. The multiport 
register file stores results from and operands for the functional units. The instruction issue register includes a plurality 
of issue slots for containing operations to be commenced in a single clock cycle, in parallel, on the functional units 151 , 

35 152, 153, .... A decompression unit 155, explained more fully below, converts the compressed instructions from the 
instruction cache 103 into a form usable by the MR 154. 

COMPRESSED INSTRUCTION FORMAT 

40 1 . General Characteristics 

[0027] The preferred embodiment of the claimed instruction format is optimized for use in a VLIW machine having 
an instruction word which contains 5 issue slots. The format has the following characteristics 

45 - unaligned, variable length instructions; 

variable number of operations per instruction; 

3 possible sizes of operations: 26, 34 or 42 bits (also called a 26/34/42 format), 
the 32 most frequently used operations are encoded more compactly than the other operations; 
operations can be guarded or unguarded; 
50 - operations are one of zeroary, unary, or binary, i.e. they have 0, 1 or 2 operands; 
operations can be resultless; 

operations can contain immediate parameters having 7 or 32 bits 
branch targets are not compressed; and 

format bits for an instruction are located in the prior instruction. 

55 

2. Instruction Alignment 

[0028] Except for branch targets, instructions are stored aligned on byte boundaries in cache and main memory. 
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Instructions are unaligned with respectto word or block boundaries in either cache or main memory. Unaligned instruc- 
tion cache access is therefore needed. 

[0029] In order to retrieve unaligned instructions, processor retrieves one word per clock cycle from the cache. 
[0030] As will be seen from the compression format described below, branch targets need to be uncompressed and 
5 must fall within a single word of the cache, so that they can be retrieved in a single clock cycle. Branch targets are 
aligned by the compiler or programmer according to the following rule: 

if a word boundary falls within the branch target or exactly at the end of the branch target, padding is added to 
make the branch target start at the next word boundary 

Because the preferred cache retrieves double words in a single clock cycle, the rule above can be modified to substitute 
10 double word boundaries for word boundaries. 

[0031] The normal unaligned instructions are retrieved so that succeeding instructions are assembled from the tail 
portion of the current word and an initial portion of the succeeding word. Similarly, all subsequent instructions may be 
assembled from 2 cache words, retrieving an additional word in each clock cycle. 

[0032] This means that whenever code segments are relocated (for instance in the linker or in the loader) alignment 
15 must be maintained. This can be achieved by relocating base addresses of the code segments to multiples of the 
cache block size. 

[0033] Figs. 2a-e show unaligned instruction storage in cache in accordance with the invention. 
[0034] Fig. 2a shows two cache words with three instructions i1, i2, and i3 in accordance with the invention. The 
instructions are unaligned with respectto word boundaries. Instructions i1 and i2can be branch targets, because they 
20 fall entirely within a cache word. Instruction i3 crosses a word boundary and therefore must not be a branch target. 
For the purposes of these examples, however, it will be assumed that i1 and only i1 is a branch target. 
[0035] Fig. 2b shows an impermissible situation. Branch target i1 crosses a word boundary. Accordingly, the compiler 
or programmer must shift the instruction i1 to a word boundary and fill the open area with padding bytes, as shown in 
Fig. 2c. 

25 [0036] Fig. 2d shows another impermissible situation. Branch target instruction i1 ends precisely at a word boundary. 
In this situation, again i1 must be moved over to a word boundary and the open area filled with padding as shown in 
Fig. 2e. 

[0037] Branch targets must be instructions, rather than operations within instructions. The instruction compression 
techniques described below generally eliminate no-ops (dummy instructions). However, because the branch target 
30 instructions are uncompressed, they must contain no-ops to fill the issue slots which are notto be used by the processor. 

3. Bit and Byte order 

[0038] Throughout this application bit and byte order are little endian. Bits and bytes are listed with the least significant 
35 bits first, as below: 

Bit number 0....8.,..16.... 
40 Byte number 0 1 2 

address 0 I 2 

4. Instruction format 

45 

[0039] The compressed instruction can have up to seven types of fields. These are listed below. The format bits are 
the only mandatory field. 

[0040] The instructions are composed of byte aligned sections. The first two bytes contain the format bits and the 
first group of 2-bit operation parts. All of the other fields are integral multiples of a byte, except for the second 2-bit 
50 operation parts which contain padding bits. 

[0041] The operations, as explained above can have 26, 34, or 42 bits. 26-bit operations are broken up into a 2-bit 
part to be stored with the format bits and a 24-bit part. 34-bit operations are broken up into a 2 bit part, a 24-bit part, 
and a one byte extension. 42-bit operations are broken up into a 2 bit part, a 24 bit part, and a two byte extension. 

55 A. Format bits 

[0042] These are described in section 5 below. With a 5 issue slot machine, 10 format bits are needed. Thus, one 
byte plus two bits are used. 
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B. 2-bit operation parts, first group 

[0043] While most of each operation is stored in the 24-bit part explained below, i.e. 3 bytes, with the preferred 
instruction set 24 bits was not adequate. The shortest operations required 26 bits. Accordingly, it was found that the 
six bits left over in the bytes forthe format bit field could advantageously be used to store extra bits from the operations, 
two bits for each of three operations. If the six bits designated forthe 2-bit parts are not needed, they can be filled with 
padding bits. 

C. 24-bit operation parts, first group 

[0044] There will be as many 24 bit operation parts as there were 2 bit operation parts in the two bit operation parts, 
first group. In other words, up to three 3 byte operation parts can be stored here. 

D. 2 bit operation parts, second group 

[0045] In machines with more than 3 issue slots a second group of 2-bit and 24-bit operation parts is necessary. The 
second group of 2-bit parts consists of a byte with 4 sets of 2-bit parts. If any issue slot is unused, its bit positions are 
filled with padding bits. Padding bits sit on the left side of the byte. In a five issue slot machine, with all slots used, this 
section would contain 4 padding bits followed by two groups of 2-bit parts. The five issue slots are spread out over the 
two groups: 3 issue slots in the first group and 2 issue slots in the second group. 

E. 24-bit operation parts, second group 

[0046] The group of 2-bit parts is followed by a corresponding group of 24 bit operation parts. In a five issue slot 
machine with all slots used, there would be two 24-bit parts in this group. 

F. further groups of 2-bit and 24-bit parts 

[0047] In a very wide machine, i.e. more than 6 issue slots ; further groups of 2-bit and 24-bit operation parts are 
necessary. 

G. Operation extension 

[0048] At the end of the instruction there is a byte-aligned group of optional 8 or 1 6 bit operation extensions, each 
of them byte aligned. The extensions are used to extend the size of the operations from the basic 26 bit to 34 or 42 
bit, if needed. 

[0049] The formal specification for the instruction format is: 
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< instruction > 

instruction start > 
< instruction middle > 

< instruction end > 

< instruction extension > 

10 

< instruction start > :: = 

<Format:2*N> { < padding: 1 > }V2{ < 2-bit operation part:2 >}V1{ < 24- 
bit operation part:24> }VI 
< instruction middle > {{<2-bit operation part:2>}4 {24-bit operation 
part:24>}4}V3 

< instruction end > {< padding: I > }V5{ < 2-bit operation part:2>}V4 {24-bit 
operation part:24 > } V4 

< instruction extension >:: = {< operationextension:0/8/16 > }S 
<padding>:: = W 0 M 

25 

Wherein the variables used above are defined as follows: 

N = the number of issue slots of the machine, N>1 
30 s = the number of issue slots used in this instruction (0<S<N) 

C1 = 4 - (N mod 4) 

If (S < C1) then V1=S and V2 = 2*(C1-V1) 
If (S > C1) then V1=C1 and V2 =0 
V3 = (S-V1)div4 
35 V4 = (S-V1)mod4 

If (V4 > 0) then V5 = 2*(4-V4) else V5=0 



40 
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Explanation of notation 
[0050] 

;:= means "is defined as" 

< field name:number> 

means the field indicated before the colon has 
the number of bits indicated after the 
colon. 

15 {< field name> } number 

means the field indicated in the angle brackets 
and braces is repeated the number of 
20 times indicated after the braces 

"0" means the bit "0'\ 

M div" means integer divide 

25 "mod" means modulo 

:0/8/16 

means that the field is 0, 8, or 16 bits long 

30 

[0051] Examples of compressed instructions are shown in Figs. 4 a-f. 

[0052] Fig. 4a shows an instruction with no operations. The instruction contains two bytes, including 10 bits for the 
format field and 6 bits which contain only padding. The former is present in all the instructions. The latter normally 
35 correspond to the 2-bit operation parts. The X's at the top of the bit field indicate that the fields contain padding. In the 
later figures, an O is used to indicate that the fields are used. 

[0053] Fig. 4b shows an instruction with one 26-bit operation. The operation includes one 24 bit part at bytes 3-5 
and one 2 bit part in byte 2. The 2 bits which are used are marked with an O at the top. 

[0054] Fig. 4c shows an instruction with two 26-bit operations. The first 26-bit operation has its 24-bit part in bytes 
40 3-5 and its extra two bits in the last of the 2-bit part fields. The second 26-bit operation has its 24-bit part in bytes 6-8 
and its extra two bits in the second to last of the 2-bit part fields. 

[0055] Fig. 4d shows an instruction with three 26-bit operations. The 24-bit parts are located in bytes 3-11 and the 
2-bit parts are located in byte 2 in reversed order from the 24-bit parts. 

[0056] Fig. 4e shows an instruction with four operations. The second operation has a 2 byte extension. The fourth 
45 operation has a one byte extension. The 24-bit parts of the operations are stored in bytes 3-11 and 13-15. The 2-bit 

parts of the first three operations are located in byte 2. The 2-bit part of the fourth operation is located in byte 12. An 

extension for operation 2 is located in bytes 1 6-1 7. An extension for operation 4 is located in byte 1 8. 

[0057] Fig. 4f shows an instruction with 5 operations each of which has a one byte extension. The extensions all 

appear at the end of the instruction. 
50 [0058] While extensions only appear after the second group of 2-bit parts in the examples, they could equally well 

appear at the end of an instruction with 3 or less operations. In such a case the second group of 2-bit parts would not 

be needed. 

[0059] There is no fixed relationship between the position of operations in the instruction and the issue slot in which 
they are issued. This makes it possible to make an instruction shorter when not all issue slots are used. Operation 
55 positions are filled from left to right. The Format section of the instruction indicates to which issue slot a particular 
operation belongs. For instance, if any instruction contains only one operation, then it is located in the first operation 
position and it can be issued to any issue slot, not just slot number 1 . The decompression hardware takes care of 
routing operation to their proper issue slots. 
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[0060] No padding bytes are allowed between instructions that form one sequential block of code. Padding blocks 
are allowed between distinct blocks of code. 

5. Format Bits 

5 

[0061] The instruction compression technique of the invention is characterized by the use of a format field which 
specifies which issue slots are to be used by the compressed instruction. To achieve retrieval efficiency, format bits 
are stored in the instruction preceding the instruction to which the format bits relate. This allows pipelining of instruction 
retrieval. The decompression unit is alerted to how many issue slots to expect in the instruction to follow prior to retrieval 
10 of that instruction. The storage of format bits preceding the operations to which they relate is illustrated in Fig. 3. 
Instruction 1, which is an uncompressed branch target, contains a format field which indicates the issue slots used by 
the operations specified in instruction 2. Instructions 2 through 4 are compressed. Each contains a format field which 
specifies issue slots to be used by the operations of the subsequent instruction. 

[0062] The format bits are encoded as follows. There are 2*N format bits for an N-issue slot machine. In the case of 
15 the preferred embodiment, there are five issue slots. Accordingly, there are 1 0 format bits. Herein the format bits will 
be referred to in matrix notation as Format[j] where j is the bit number. The format bits are organized in N groups of 2 
bits. Bits Format[2i] and Format[2i+1] give format information about issue slot i, where 0<i<N. The meaning of the 
format bits is explained in the following table: 

20 TABLE I 



Format [2i] 
Isb 


Format[2i+1] 
msb 


meaning 


0 


0 


Issue slot i is used and an operation for it is available in the instruction. The operation 
size is 26 bits. The size of the extension is 0 bytes 


1 


0 


Issue slot i is used and an operation for it is available in the instruction. The operation 
size is 34 bits. The size of the extension is 1 byte, 


0 


1 


Issue slot i is used and an operation for it is available in the instruction. The operation 
size is 42 bits. The size of the extension is 2 bytes. 


I 


1 


Issue slot i is unused and no operation for it is included in the instruction. 



[0063] Operations correspond to issue slots in left to right order. For instance, if 2 issue slots are used, and Format 
= {1 , 0, 1 , 1 , 1 , 1 , 1 , 0, 1 , 1}, then the instruction contains two 34 bit operations. The left most is routed to issue slot 0 
35 and the right most is routed to issue slot 3. If Format = {1 , 1 , 1 , 1 , 1 , 0, 1 , 0, 1 , 0}, then the instruction contains three 
34 bit operations, the left most is routed to issue sot 2, the second operation is intended for issue slot 3, and the right 
most belongs to issue slot 4. 

[0064] The format used to decompress branch target instructions is a constant. Constant_Format = {0, 1 , 0, 1 , 0, 1 . 
0, 1,0, 1} for the preferred five issue slot machine. 

40 

6. Operation Formats 

[0065] The format of an operation depends on the following properties 

45 - zeroary, unary, or binary; 

parametric or non-parametric. Parametric instructions contain an immediate operand in the code. Parameters can 
be of differing sizes. Here there are param7, i.e. seven bit parameters, and param32, i.e. 32 bit parameters, 
result producing or resultless; 

long or short op code. The short op codes are the 32 most frequent op codes and are five bits long. The long op 
50 codes are eight bits long and include all of the op codes, including the ones which can be expressed in a short 

format. Op codes 0 to 31 are reserved for the 32 short op codes 

guarded or unguarded. An unguarded instruction has a constant value of the guard of TRUE, 
latency. A format bit indicates if operations have latency equal to one or latency larger than 1 . 
signed/unsigned. A format bit indicates for parametric operations if the parameter is signed or unsigned. 

55 

[0066] The guarded or unguarded property is determined in the uncompressed instruction format by using the special 
register file address of the constant 1 . If a guard address field contains the address of the constant 1 , then the operation 
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is unguarded, otherwise it is guarded. Most operations can occur both in guarded and unguarded formats. An immediate 
operation, i.e. an operation which transfers a constant to a register, has no guard field and is always unguarded. 
[0067] Which op codes are included in the list of 32 short op codes depends on a study of frequency of occurrence 
which could vary depending on the type of software written. 
5 [0068] The table II below lists operation formats used by the invention. Unless otherwise stated, all formats are: not 
parametric, with result, guarded, and long op code. To keep the tables and figures as simple as possible the following 
table does not list a special form for latency and signed/unsigned properties. These are indicated with L and S in the 
format descriptions. For non-parametric, zeroary operations, the unary format is used. In that case the field for the 
argument is undefined. 

10 

TABLE II 



OPERATION TYPE 


SIZE 


< binary-unguarded-short > 


26 


< unary-param7-unguarded-short> 


26 


< binary-unguarded-param7-resultless-short > 


26 


< unary-short> 


26 


< binary-short > 


34 


< unary-param7-short> 


34 


< binary-param7-resultless-short> 


34 


< binary-unguarded > 


34 


< binary-resultless> 


34 


< unary-param7-unguarded > 


34 


< unary > 


34 


< binary-param7-resultless> 


42 


< binary > 


42 


< unary-param7 > 


42 


< zeroary-param32 > 


42 


< zeroary-param32-resultless> 


42 



[0069] For all operations a 42-bit format is available for use in branch targets. For unary and binary-resultless oper- 
ations, the < binary> format can be used. In that case, unused fields in the binary format have undefined values. Short 
5-bit op codes are converted to long 8-bit op codes by padding the most significant bits with 0's. Unguarded operations 
get as a guard address value, the register file address of constant TRUE. For store operations the 42 bit, binary- 
param7-resultless> format is used instead of the regular 34 bit < binary-param7-resultless short> format (assuming 
store operations belong to the set of short operations). 

[0070] Operation types which do not appear in table II are mapped onto those appearing in table II, according to the 
following table of aliases: 



TABLE II' 



FORMAT 


ALIASED TO 


zeroary 


unary 


unary_resultless 


unary 


binary_resultless_short 


binary_resultless 


zeroary_param32_short 


zeroary_param32 


zeroary_param32_resultless_short 


zeroary_param32_resultless 


zeroary_short 


unary 
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TABLE II' (continued) 



FORMAT 


ALIASED TO 


unary_resultless_short 


unary 


binary_resultless_un guarded 


b i n a ry_res u Itl ess 


unary_unguarded 


unary 


binary_param7_resultless_unguarded 


binary_param7_resultless 


unary_un guarded 


unary 


binary param7 resultless unguarded 


binary param7 resultless 


zeroary_unguarded 


unary 


unary_resultless_unguarded_short 


binary_unguarded_short 


unary_unguarded_short 


unary_short 


zeroary_param32_unguarded_short 


zeroary_param32 


zeroary_parame32_resultless_unguarded_short 


zeroary_param32_resultless 


zeroary_unguarded_short 


unary 


unary_resultless_unguarded_short 


unary 


unary_long 


binary 


binary_long 


binary 


binary resultless long 


binary 


unary param7 lonq 


unary_param7 


bi nary_param7_resu ltless_long 


b i n a ry_pa ram7_res u Itless 


zeroary_param32 Jong 


zeroary_param32 


zeroary_param32_resultless_long 


zeroary_param32_resultless 


zeroaryjong 


binary 


unary_resultless_long 


binary 



[0071] The following is a table of fields which appear in operations: 

TABLE III 



FIELD 


SIZE 


MEANING 


srd 


7 


register file address of first operand 


src2 


7 


register file address of second operand 


guard 


7 


register file address of guard 


dst 


7 


register file address of result 


pa ram 


7/32 


7 bit parameter or 32 bit immediate value 


op code 


5/8 


5 bit short op code or 8 bit long op code 



50 

[0072] Fig. 5 includes a complete specification of the encoding of operations. 

7. Extensions of the instruction format 

55 [0073] Within the instruction format there is some flexibility to add new operations and operation forms, as long as 
encoding within a maximum size of 42 bits is possible. 

[0074] The format is based on 7-bit register file address. For register file addresses of different sizes, redesign of 
the format and decompression hardware is necessary. 



11 



EP 0 843 848 B1 

[0075] The format can be used on machines with varying numbers of issue slots. However, the maximum size of the 
instruction is constrained by the word size in the instruction cache. In a 4 issue slot machine the maximum instruction 
size is 22 bytes (176 bits) using four 42-bit operations plus 8 format bits. In a five issue slot machine, the maximum 
instruction size is 28 bytes (224 bits) using five 42-bit operations plus 10 format bits. 
5 [0076] In a six issue slot machine, the maximum instruction size would be 264 bits, using six 42-bit operations plus 
1 2 format bits. If the word size is limited to 256 bits, and six issue slots are desired, the scheduler can be constrained 
to use at most 5 operations of the 42 bit format in one instruction. The fixed format for branch targets would have to 
use 5 issue slots of 42 bits and one issue slot of 34 bits. 

10 COMPRESSING THE INSTRUCTIONS 

[0077] Fig. 8 shows a diagram of how source code becomes a loadable, compressed object module. First the source 
code 801 must be compiled by compiler 802 to create a first set of object modules 803. These modules are linked by 
linker 804 to create a second type of object module 805. This module is then compressed and shuffled at 806 to yield 
15 loadable module 807. 

Any standard compiler or linker can be used. Object modules II contain a number of standard data structures. These 
include: a header; global & local symbol tables; reference table for relocation information; a section table; and debug 
information, some of which are used by the compression and shuffling module 807. The object module II also has 
partitions, including a text partition, where the instructions to be processed reside, and a source partition which keeps 
20 track of which source files the text came from. 

[0078] A high level flow chart of the compression and shuffling module is shown at Fig. 9. At 901 , object module II 
is read in. At 902 the text partition is processed. At 903 the other sections are processed. At 904 the header is updated. 
At 905, the object module is output. 

[0079] Fig. 10 expands box 902. At 1001 , the reference table, i.e. relocation information is gathered. At 1002, the 
25 branch targets are collected, because these are not to be compressed. At 1003, the software checks to see if there 
are more files in the source partition. If so, at 1004, the portion corresponding to the next file is retrieved. Then, at 
1 005, that portion is compressed. At 1 006, file information in the source partition is updated. At 1 007, the local symbol 
table is updated. 

[0080] Once there are no more files in the source partition, the global symbol table is updated at 1008. Then, at 
30 1009, address references in the text section are updated. Then at 1010, 256-bit shuffling is effected. Motivation for 
such shuffling will be discussed below. 

[0081] Fig. 11 expands box 1 005. First, it is determined at 1 1 01 whetherthere are more instructions to be compressed. 
If so, a next instruction is retrieved at 1102. Subsequently each operation in the instruction is compressed at 1103 as 
per the tables in Figs. 5a and 5b and a scatter table is updated at 1108. The scatter table is a new data structure, 

35 required as a result of compression and shuffling, which will be explained further below. Then, at 1104, all of the 
operations in an instruction and the format bits of a subsequent instruction are combined as per Figs. 4a - 4e. Subse- 
quently the relocation information in the reference table must be updated at 1 1 05, if the current instruction contains an 
address. At 1106, information needed to update address references in the text section is gathered. At 1107, the com- 
pressed instruction is appended at the end of the output bit string and control is returned to box 1101. When there are 

40 no more instructions, control returns to box 1006. 

[0082] Functions for handling compression are implemented in the various modules as listed below: 



TABLE IV 



Name of module 


identification of function performed 


scheme_table 


readable version of table of Figs. 5a and 5b 


comp_shuffle.c 


256-bit shuffle, see box 1 01 0 


comp_scheme.c 


boxes 1103-1104 


comp_bitstring.c 


boxes 1 005 & 1 009 


comp_main.c 


controls main flow of Figs. 9 and 10 


comp_src.c, 
comp_reference.c, 
comp_misc.c, 
comp_btarget.c 


miscellaneous support routines for performing other functions listed in Fig. 11 
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[0083] The scatter table, which is required as a result of the compression and shuffling of the invention, can be 
explained as follows. 

[0084] The reference table contains a list of locations of addresses used by the instruction stream and corresponding 
list of the actual addresses listed at those locations. When the code is compressed, and when it is loaded, those 

5 addresses must be updated. Accordingly, the reference table is used at these times to allow the updating. 

[0085] However, when the code is compressed and shuffled, the actual bits of the addresses are separated from 
each other and reordered. Therefore, the scatter table lists, for each address in the reference table, where EACH BIT 
is located. In the preferred embodiment the table lists, a width of a bit field, an offset from the corresponding index of 
the address in the source text, a corresponding offset from the corresponding index in the address in the destination text. 

10 [0086] When object module III is loaded to run on the processor the scatter table allows the addresses listed in the 
reference table to be updated even before the bits are deshuffled. 

[0087] The scatter table contains, by way of example, as a set of scatter descriptors. Each scatter descriptor contains 
a set of triples (destination offset, width, source offset) and an unsigned integer indicating the number of triples in the 
descriptor. 

15 [0088] For example, let us say we have a scatter descriptor with three triples (0,7,3), (7,4,15), (11,5, 23). Let us say 
the source field is at position 320 in the bitfield of the text section. To get the bits of the actual address field, we do the 
following: bits 0 through 6 (7 bits) of the address field are from positions 323 (320+3) through 329 from the bitstring, 
bits 7 through 1 0 of the address field are from bits 335 through 338 of the bitstring, and bits 11 through 15 of the address 
field are from bits 343 through 347 of the bitstring. Thus the address field has length 16. 

20 [0089] In the object module a list of reference descriptors is associated with a bitstring. Each reference descriptor 
refers to a bitfield in the bitstring. Each reference descriptor contains an index into the scatter table where we can find 
the scatter descriptor that has information about the way the bits of the bitfield are scattered in the bitstring. For example, 
if a bitfield has position 11 in a bitstring and the scatter descriptor corresponding to the bitfield has a single entry (0,18,0), 
then the actual source offset is obtained by adding the position and the source offset together: 11+0. 

25 

DECOMPRESSING THE INSTRUCTIONS 

[0090] In order for the VLIW processor to process the instructions compressed as described above, the instructions 
must be decompressed. After decompression, the instructions will fill the instruction register, which has N issue slots, 
30 N being 5 in the case of the preferred embodiment. Fig. 1 2 is a schematic of the decompression process. Instructions 
come from memory 1201, i.e. either from the main memory 104 or the instruction cache 105. The instructions must 
then be deshuffled 1201 , which will be explained further below, before being decompressed 1203. After decompression 
1203, the instructions can proceed to the CPU 1204. 

[0091] Each decompressed operation has 2 format bits plus a 42 bit operation. The 2 format bits indicate one of the 
35 four possible operation lengths (unused issue slot, 26-bit, 34-bit, or 42-bit). These format bits have the same values 
is "Format" in section 5 above. If an operation has a size of 26 or 34 bits, the upper 8 or 16 bits are undefined. If an 
issue slot is unused, as indicated by the format bits, then all operation bits are undefined and the CPU has to replace 
the op code by a NOP op code (or otherwise indicate NOP to functional units). 
[0092] Formally the decompressed instruction format is 

40 

< decompressed instruction > ::= {< decompressed operation >}N 
< decompressed operation > : : — < operation :42 > < format: 2 > 

45 

[0093] Operations have the format as in Table III (above). 

[0094] Appendix A is VERILOG code which specifies the functioning of the decompression unit. VERILOG code is 
a standard format used as input to the VERILOG simulator produced by Cadence Design Systems, Inc. of San Jose, 
California. The code can also be input directly to the design compiler made by Synopsys of Mountain View California 
50 to create circuit diagrams of a decompression unit which will decompress the code. The VERILOG code specifies a 
list of pins of the decompression unit these are 



TABLE V 



# of pins in group 


name of group of pins 


description of group of pins 


512 


data512 


51 2 bit input data word from memory, i.e. either from the 
instruction cache or the main memory 
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TABLE V (continued) 



# of pins in group 


name of group of pins 


description of group of pins 


32 


PC 


input program counter 


44 


operation4 


output contents of issue slot 4 


44 


operation3 


output contents of issue slot 3 


44 


operation2 


output contents of issue slot 2 


44 


operation 1 


output contents of issue slot 1 


44 


operationO 


output contents of issue slot 0 


10 


format_out 


output duplicate of format bits in operations 


32 


first_word 


output first 32 bits pointed to by program counter 


1 


format_ctrlO 


is it a branch target or not? 


1 , each 


reissuel stall_in freeze reset elk 


input global pipeline control signals 



[0095] Data51 2 is a double word which contains an instruction which is currently of interest. In the above, the program 
counter, PC is used to determine data512 according to the following algorithm: 



A: = {PC[3l:8],8*b(n 
if PC[5]= Othen 

dataS^ 1 : = {M(A), M(A+32)}" 
else data512':= {M(A+32),M(A)} 

where 

A is the address of a single word in memory which contains an instruction of interest; 
8'bO means 8 bits which are zeroed out 
M(A) is a word of memory addressed by A; 
M(A+32) is word of memory addressed by A+32; 
data51 2' is the shuffled version of data 51 2 

This means that words are swapped if an odd word is addressed. 

[0096] Operations are delivered by the decompression unit in a form which is only partially decompressed, because 
the operation fields are not always in the same bit position. Some further processing has to be done to extract the 
operation fields from their bit position, most of which can be done best in the instruction decode stage of the CPU 
pipeline. For every operation field this is explained as follows: 

srd The srd field is in a fixed position and can be passed directly to the register file as an address. 

Only the 32-bit immediate operation does not use the srd field. In this case the CPU control will 
not use the srd operand from the register file. 

src2 The src2 field is in a fixed position if it is used and can be passed directly to the register file as 

address. If it is not used it has an undefined value. The CPU control makes sure that a "dummy" 
src2 value read from the register file is not used. 

guard The guard field is in a fixed position if it is used and can be passed directly to the register file as 

an address. Simultaneously with register file access, the CPU control inspects the op code and 
format bits of the operation. If the operation is unguarded, the guard value read from the RF (reg- 
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ister file) is replaced by the constant TRUE. 

op code Short or long op code and format bits are available in a fixed position in the operation. They are 

in bit position 21-30 plus the 2 format bits. They can be fed directly to the op code decode with 
maximum time for decoding. 

dst The dst field is needed very quickly in case of a 32-bit immediate operation with latency 0. This 

special case is detected quickly by the CPU control by inspecting bit 33 and the formal bits. In all 
other cases there is a full clock cycle available in the instruction decode pipeline state to decode 
where the dst field is in the operation (it can be in many places) and extract it. 

32-bit immediate If there is a 32-bit immediate it is in a fixed position in the operation. The 7 least significant bits 
are in the src2 field in the same location as a 7-bit parameter would be. 

7-bit parameter If there is a 7-bit parameter it is in the src2 field of the operation. There is one exception: the store 
with offset operation. For this operation, the 7-bit parameter can be in various locations and is 
multiplexed onto a special 7-bit immediate bus to the data cache. 

BIT SWIZZLING 

[0097] Where instructions are long, e.g. 512 bit double words, cache structure becomes complex. It is advantageous 
to swizzle the bits of the instructions in order to simplify the layout of the chip. Herein, the words swizzle and shuffle 
are used to mean the same thing. The following is an algorithm for swizzling bits. 



for (k=0; k<4; k = k+l) 

for (i=0; i<8; i = i+i) 

for (}=0; j<8; j=j + l) 
begin 

word - $huffledtk*64+j*8+i] =* 



~ word - unshuffled[(4*i+k)*8 + j]* 

end 

where i, j, and k are integer indices; word_shuffled is a matrix for storing bits of a shuffled word; and word_unshuffled 
is matrix for storing bits of an unshuffled word. 

CACHE STRUCTURE 

[0098] Fig. 6a shows the functioning on input of a cache structure which is useful in efficient processing of VLIW 
instructions. This cache includes 1 6 banks 601 -61 6 of 2k bytes each. These banks share an input bus 61 7. The caches 
are divided into two stacks. The stack on the left will be referred to as "low" and the stack on the right will be referred 
to as "high". 

[0099] The cache can take input in only one bank at a time and then only 4 bytes at a time. Addressing determines 
which 4 bytes of which bank are being filled. For each 512 bit word to be stored in the cache, 4 bytes are stored in 
each bank. A shaded portion of each bank is illustrated indicating corresponding portions of each bank for loading of 
a given word. Theseshaded portions are for illustration only. Any given word can be loaded into any set of corresponding 
portions of the banks. 

[0100] After swizzling according to the algorithm indicated above, sequential 4 byte portions of the swizzled word 
are loaded into the banks in the following order 608, 616, 606, 614, 604, 612, 602, 610, 607, 615, 605, 613, 603, 611, 
601 , 609. The order of loading of the 4 byte sections of the swizzled word is indicated by roman numerals in the boxes 
representing the banks. 
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[0101] Fig 6b shows how the swizzled word is read out from the cache. Fig. 6b shows only the shaded portions of 
the banks of the low stack. The high portion is analogous. Each shaded portion 601a-608a has 32 bits. The bits are 
loaded onto the output bus, called bus256low, using the connections shown, i.e. in the following order: 608a - bitO, 
607a - bitO, 601a - bit 0; 608a - bit 1 , 607a - bit1 , 601a-- bit 1; 608a- bit 31, 607a - bit 31 , 601a - bit 31. 
5 Using these connections, the word is automatically de-swizzled back to its proper bit order. 

[0102] The bundles of wires, 620, 621 , 622 together form the output bus256 low. These wires pass through the 
cache to the output without crossing 

[0103] On output, the cache looks like Fig. 7. The bits are read out from stack low 701 and stack high 702 under 
control of control unit 704 through a shift network 703 which assures that the bits are in the output order specified 
10 above. In this way the entire output of the 512 bit word is assured without bundles 620, 621 , ... 622 and analogous 
wires crossing. 

[0104] In the preceding a VLIW processor has been described that uses compressed instructions. The VLIW proc- 
essor has an instruction issue register comprising a plurality of issueslots, each issue slot being forstoring a respective 
operation, all of the operations starting execution in a same clock cycle. The VLIW processor has a plurality of functional 

15 units for executing the operations stored in the instruction register. The VLIW processor has a decompression unit for 
providing decompressed instructions to the instruction issue register, the decompression unit taking compressed in- 
structions from a compressed instruction storage medium and decompressing the compressed instructions. At least 
one of the compressed instructions includes at least one operation, each operation being compressed according to a 
compression scheme which assigns a compressed operation length to that operation. The compressed operation length 

20 is chosen from a plurality of finite lengths, which finite lengths include at least two non-zero lengths, which of the finite 
lengths is chosen being dependent upon at least one feature of the operation. 

[0105] Preferably, the set of operation lengths is {0, 26, 34, 42}. Also preferably the at least one feature is at least 
one of the following: 

25 - abbreviated op code; 

guarded or unguarded; 
resultless; 

immediate parameter with fixed number of bits; and 
zeroary, unary, or binary. 

30 

The fixed number is preferably one of 7 and 32. Preferably the processor comprises a plurality of such instructions, of 
which one instruction is a branch target, which one instruction is not compressed. Preferably, each operation field within 
each instruction includes a sub-field specifying at least one of the following: a register file address of a first operand; 
a register file address of a second operand; a register file address of guard information; a register file address of a 

35 result; an immediate parameter; and an op code. Preferably, each instruction comprises a format field for specifying a 
plurality of respective formats, one respective format for each operation of a succeeding instruction. Preferably the 
compressed format comprises a fo rmat field specifying issue slots of the VLIW processorto be used by some instruction . 
Preferably at least one field specifies the operation. The field specifying the operation comprises at least one byte 
aligned sub-field. Preferably at least one operation part sub-field is located in a same byte with the format field. Thus 

40 instructions may be aligned with a byte boundary, not just word boundaries. Preferably the format field may specify 
that more than a threshold quantity of issue slots are to be used and further comprises at least one first operation part 
sub-field located in a same byte with the format field, a plurality of sub-fields specifying operations, and at least one 
second operation part sub-field located in a byte separate from the other sub-fields. 

45 



50 



55 
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APPENDIX A 



// Verilog HDL for Icaehe. ie_deeorapression ^behavioral 

* define ?IX3D_FORKAT 10 'blOlOlOlOlO 
* define NOP_OFERATION 42' bO 

module i ^decompression <data512, pc, 

operacion4, operacian3 , operacion2, operationl, 

operacionO, foraiac_ouc, firsc_word. 
forma t_«rl0. 

rezssuel. scail_in, freeze, reset, clfc) ; 

iapuc (511:0] data512; 
inpuc [31:0 J pc; 

oucput (43:0] operationO, operationl, operation^, operation^ * operacion4 

reg [43:0] operacionO, operationl. operacion2 . operaticn3. operacion4; 

oucpuc (31:0 J firsc.word; 

reg [31:0] firsc_vord; 

inpuc formac_crr!0; 

inpuc reissue! , freeze, reset, elk; 

wire [9:0] f onoac.ouci ; 

oucpuc [9:01 formac_out; 

reg [9:01 formac_oucA. formac_p; 

inpuc s call-in ; 

// local 

reg [9:0] t oraac_oucO ; 
reg [31:0] pc_p; 
reg formac_ecrl; 

reg [9:0] format; 

reg usedO, used!, used* . usec3, used4; 

reg [1:0] sizeO , sizel. sire2. sire3 , size4; 

reg [511:0] daca512snif t; 

reg (255:0) daca256; 

reg [2:0] posl, pos2 . pos3. pos4, pos_exc; 

reg [25:0] fixO, fixl. fix2, £ix3, fix4; 

reg (79:0] exner.siar.O , expansion I . excension2. exieasiorJ, ex c ens ion 4 ; 

reg [ 15 : 0 j excO , excl. ext2 . excJ, exc4; 

reg resec_p ; 

// foraac pipe 

always @ (posedge elk) . 

begin 

reset_p <= reset; 

if (resec_p) 
beg^n 

// force NOP operations on instructions 
opera cion.0 [42] 
opera cionO (43 ) 
opera cionl [42] 
operation! [43 j 
opera rion2 (42 ] 
opera tion2 (43) 
opera cion3 [42 J 
opera cionJ (43 } 
operacion4 (42) 
operacicn4 (43 j 

end 

else if (-scall_in) 
begin 

pc_p <= pc; 

foraac^cirl <= £ormac_ccr!0 ; 



< = 


'ONE 


< = 


'ONE 




*ONZ 


< = 


•ONE 


< = 


*QNZ 


< = 


% CNE 


< = 


*CNE 




•ONE 


< = 


*ONS 


< - 


*CNE 
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if {formac_ccrl> 

format ■ *FIXSD_FOR«AT; 
else 

format = formac.ouer 

usedO - -<£ormat[l] & foraac (0] ) ; 
sireO = usedO ? £ormac[l:0] : 2'bO; 

usedl = -<farmatC3] & format(2]); 
ffl sizel = usedl ? £arm*t[3:2] : 2'hQ; 

used2 = - {format f 5 J & formac{4J); 
sise2 = used2 ? format [5: 4} : 2'bGr 



f5 



used3 = - (format (7] & format ( 6] ) ; 
size! = used3 ? foraac [7: 6] : 2*b0; 

used 4 ■ - ( format [9 ] & formac(8]); 
sire4 = used4 ? format [9: 8) : 2*b0; 

/ / first alignment scage 

// rotate the 512 bit word right over a distance between 0 and 64 byte 

20 // 

/ / che rotate is implemented here by swapping the left and righ= word ii the 
// distance is more Chan 32 byte and then perform a right shift over a distance 
between 

// 0 and 32 byte. 



25 



daca512shifc = pc_p[5] ? 

{data512 [255:0), dataSiS [511;256] } >> (pc_p{4:0], 3'b0> : 
daca512 » (pc_p[4:0), 3'b0>: 

data25S = daca512shxr c ( 255 : 0) ; 

// extract foraac bits 
30 format_out0 = data25o C9 :0) ; 

// access first word 

firsts word <= daca25o [31 : 0 } ; 

// Notes: - the value for pa s_ext==G is dcn't care 

// - for values pos_ext < 5 , less than SO bits are needed 

ff determine the position of issue slots 

//posO s 0; 
pasl = usedO ; 
pos2 s usedO usedl; 
posJ = usedO * usedl • used2; 
pos4 = usedO -> usedl *■ used2 * used3 ,* 

// mux the fixed part of issue slots, combine the 24-bit part and the 2-bit part 
fixO = usedO ? {daca256 (15 :14] , data256 [ ( 0*1 ) -24+15 : 0*24*16]} : » NOP_OPE3A?ION 

ti-Al * usedl ? 



35 



40 



45 



( 

posl == 0 ? (data25S [15:14] , data2S6 t ( 0+l> -24+15 : 0*24+151 ) 

<data25o £13 : 12 J , dac*256 {{ 1*1 )• 24+15 : 1*24*161) 

> 

50 : 'NOP_OPEKATTON; 

fix2 = used2 ? 



55 
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fix J * used3 



pos2 « 0 ? [data256 [15:14] . daca256 [ { 0+1) '24+15 : 0*24*16)} 
pos2 == 1 ? {data256(13:12] , daca255 [ (1*1) -24*15 ; 1*24*16]} 
{dafca256tli:10] , data256 (( 2*1) '24*15 : 2*24*15]} 

) 

'NOFJHERAXXON; 



10 



pos3 == 0 ? (data256[15: 14] , 
pcs3 == 1 ? (data256[13:12] . 
pcs3 == 2 ? (data256(ll:101 ( 
(data256[95:94] , 



) 

*NOP_OPSSATXON; 



data256[ (0*1) *24*1S : 0-24*16]} 

data256E U+l) '24*15 s 1*24*15]} 

data256( (2*1) *24*15 : 2*24*16 J } 

data25c[ (0*1) -24*95 : 0*24*96]} 



•ix4 



used 4 



15 



20 



POS4 == 0 



pos4 
POS4 
pos4 



« 1 
2 
3 



> 

// decemine the position ci the extension part 
pos_ext = usedO * usedl - used2 - used3 - used4 ; 



<daea25c [15:14 [ , data2S6 C (0*1 ) -24*15 : 0*24*16]} 

<data2S6[13:12J , data255 [( 1*1 ) -24*15 : 1*24-15]} 

daca256( (2*1) '24-15 : 2*24*16] } 

data2S6 [ (0*1) -24-95 : 0*24*96]} 

daca255[ (1*1) '24*95 : l*24*?oj} 



(data256 [11:101 , 
(data256 [95:94) , 
<data256f93:92| 



25 



30 



// determine the extension 
extensionO - 

pos_ext == O 
pos_ext == 1 
pos_ext »* 2 
pos_ext == 3 
pos_ext == 4 



? data2So(0*24*80-l-16 : 0*24-15] 

? data256 [1*24-80-1+16 : 1*24*16] 

? data253 [2*24*30-1*16 : 2*24*15] 

? cata256 [3*24*60-1*16 : 3*24*16] 

? cata25 5 [1*24-80-1*96 : 1*24*96] 

data2S5[2*24-80-l*96 : 2*24*96); 



// shift the Extension part 

extension! = exrensionO >> (siaeO , 3'h0) 

extensions = extension! >> (size! . 3'b0) 

extension3 = extension^ >> (size2 . 3'b0) 

extension = extension^ >> (site3 . 3'bO) 



35 extO = extensionO [15 :0] 

excl * exzensionl [15:0] 
ext2 = extension2 [15:0] 
ext3 = excens ion3 (15 : 0) 
ext4 - extensicn4 [15 : 0 ] 



40 



45 



// assemble : 
/ /operacionO 
/ /operation! 
/ /operacion2 
/ /operate on3 
//operation4 
operationo <• 
operaticnl <: 
operation2 <: 
operation^ < 
operacion4 <: 



ns true t ion 

<= {format. 

< = (format. 

<= " { format. 

<= (format. 

<= (format, 
(format CI: 
(format [3: 
(format (5: 
(format(7; 
(format (9i 



out0(l:0] 
,out0(3:2I , 
.OUtO (5:4), 
out0(7:6] 
out0[9:8] , 
0] , extO, 
21, exti. 
41, ext2, 
5 ] , ext3 , 
B) , ext4. 



extO, 
extl, 
ext2, 
ext3 . 
ext4» 

fixO) ; 

fixl}; 

fix2}; 

fix3); 

f ix4 } ; 



fixO } . 
f ixl ) ; 
iix2 ) ; 
fiX3> ; 
f 1X4 } ; 



50 



if (-freeze j reissuel) 

begin 

format_oucA < = forma t_ouc0 ; 

end 
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if (-freeze) 
begin 

format jp <= format^outA; 
end 



end 
end 

assign faraat_ouc = reissuel ? foraac_p : fonnac_outA; 
assign for3iac_oucl = foraac; 

endxnoduie 



Claims 

1 . A VLIW processor for using compressed instructions, the processor comprising 

an instruction issue register (154) comprising a plurality of issue slots, each issue slot being for storing a 
respective operation, all of the operations starting execution in a same clock cycle; 

a plurality of functional units (151 , 152, 153) for executing the operations stored in the instruction issue register 
(154); 

a decompression unit (155) for providing a decompressed instruction to the instruction issue register (154), 
the decompression unit (155) taking a compressed instruction from a compressed instruction storage medium 
(103) and decompressing the compressed instruction, the compressed instruction including at least two op- 
erations, each of said at least two operations being compressed to a respective compressed operation length. 

characterized in that the decompression unit (155) is arranged to decompress operations with respective com- 
pressed operation lengths chosen from a plurality of finite lengths, which finite lengths include at least two non- 
zero lengths. 

2. The processor of claim 1 , the decompression unit (155) being arranged to take a format field from the compressed 
instruction storage medium (1 03), the format field specifying the respective compressed operation length for each 
operation of the compressed instruction, the decompression unit (155) decompressing the operations of the com- 
pressed instruction according to the format field. 

3. The processor of 2 wherein the format field has N sub-fields, N being the number of issue slots, each sub-field 
specifying a compressed operation length for a respective issue slot, characterized in that the sub-fields each 
contain at least two bits. 

4. The processor of Claim 2 or 3, the decompression unit (155) being arranged for 

raking a preceding compressed instruction from the compressed instruction storage medium (103) together 
with the format field. 

starting decompression of the preceding compressed instruction and subsequently 

taking the compressed instruction from the compressed instruction memory and starting decompression of 
the compressed instruction according to the format field taken from the compressed instruction medium (1 03) 
together with the preceding compressed instruction. 

5. The processor of claim 2, 3 or 4, the decompression unit (155) taking the format field from the compressed in- 
struction storage medium (103) in a memory access unit, the memory access unit also comprising at least one 
operation part sub-field, the decompression unit (1 55) integrating the operation part sub-field in at least one of the 
operations of the decompressed instruction. 
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6. The VLIW processor of claim 1 , wherein: the decompression unit (155) is arranged for taking a stream of com- 
pressed instructions from the compressed instruction storage medium (1 03) and decompressing the compressed 
instructions, the stream of instructions comprising: a first instruction including a format field which specifies an 
instruction compression format, and wherein the stream of instructions comprises a second instruction, taken from 

5 the compressed instruction storage medium (1 03) following the first instruction, the decompression unit (1 55) being 

arranged to decompress the second instruction according to the format field in the first instruction. 

7. A method of producing compressed code for running on a VLIW processor, the method comprising the steps of 

10 - receiving an instruction comprising at least two operations 

compressing each of said at least two operations of the instruction according to a respective compression 
scheme which assigns a respective compressed operation length to the relevant operation. 

characterized in that the compressed operation length is chosen from a plurality of finite lengths, which finite 
15 lengths include at least two non-zero lengths, which of the finite lengths is chosen depending upon at least one 

feature of the operation. 

8. The method of claim 7 applied to a stream of instructions including said instruction, the method comprising the 
step of determining for each instruction whether that instruction is a branch target (M) of a branch from another 

20 instruction of the stream of instructions, and compressing only those instructions which are not branch targets. 

9. The method of Claim 7 or 8, comprising the step of producing a format field, the format field specifying a respective 
format for each operation of the instruction according to the compressed operation length chosen for that operation. 

25 10. The method of 9 wherein the format field has N sub-fields, N being the number of issue slots, each sub-field 
specifying a compressed operation length for a respective issue slot, characterized in that the sub-fields each 
contain at least two bits. 

11. The method of Claim 7, 8, 9 or 10, comprising the step of storing a compressed instruction containing the com- 
30 pressed operations in a computer readable compressed instruction storage medium (103). 

1 2. The method of claim 1 1 , when dependent on claim 9, comprising compressing a further instruction, for execution 
preceding execution of the instruction, the format field for the instruction being stored for fetching with the further 
instruction. 

35 

13. The method of claim 11 , when dependent on claim 9, or claim 12 the compressed instruction storage medium 
(1 03) having memory access units, the format field being stored in a same memory access unit with at least one 
operation part sub-field of at least one of the operations of the instruction. 

40 14. The method of claim 11 when dependent on claim 9, wherein the receiving step comprises 

receiving a stream of instructions each comprising a plurality of operations wherein the stream comprises a 
first instruction for execution preceding a second instruction from the stream, the format field corresponding 
to the second instruction and the compressed instruction corresponding to the first instruction being stored in 
45 the compressed instruction storage medium (1 03) for combined fetching during execution of the stream, prior 

to fetching of the compressed instruction corresponding to the second instruction. 

1 5. The method of claim 1 4 comprising the step of determining for each instruction whether that instruction is a branch 
target (i2) of a branch from another instruction of the stream of instructions, and compressing only those instructions 

50 which are not branch targets. 

16. A computer programmed to execute the method according to any one of the claims 7 to 15. 

55 Revendications 

1 . Processeur VLIW pour utiliser des instructions comprimees, le processeur comprenant 
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un registre d'instructions emises (154) comprenant une pluralite de champs destructions emises, chaque 
champ destruction emise servant a stocker une operation respective, toutes les operations commengant leur 
execution dans un meme cycle d'horloge ; 

une pluralite d'unites fonctionnelles (151, 152, 153) pour executer les operations stockees dans le registre 
d'instructions emises (154) ; 

une unite de decompression (1 55) pour fournir une instruction decomprimee au registre d'instructions emises 
(154), I'unite de decompression (155) extrayant une instruction comprimee d'un support de stockage d'ins- 
tructions comprimees (103) et decomprimant Instruction comprimee, Instruction comprimee comprenant au 
moins deux operations, chacune desdites au moins deux operations etant comprimee a une longueur d'ope- 
ration comprimee respective ; 

caracterise en ce que I'unite de decompression (155) est a meme de decomprimer des operations avec des 
longueurs d'operation comprimee respectives choisies parmi une pluralite de longueurs finies, lesquelles lon- 
gueurs finies incluent au moins deux longueurs non zero. 

Processeursuivant la revendication 1 , I'unite de decompression (155) etant a meme d'extraire un champ de format 
du support de stockage d'instructions comprimees (103), le champ de format specifiant la longueur d'operation 
comprimee respective pour chaque operation de I'instruction comprimee, I'unite de decompression (155) decom- 
primant les operations de I'instruction comprimee en fonction du champ de format. 

Processeursuivant la revendication 2 dans lequel le champ deformatcomprend N sous-champs ; N etant le nombre 
de champs d'instructions emises, chaque sous-champ specifiant une longueur d'operation comprimee pour un 
champ destruction emise respectif, caracterise en ce que les sous-champs contiennent chacun au moins deux 
bits. 

Processeur suivant la revendication 2 ou 3, I'unite de decompression (155) etant a meme de 

d'extraire une instruction comprimee precedente du support de stockage d'instructions comprimees (1 03) ainsi 
que le champ de format ; 

de commencer la decompression de I'instruction comprimee precedente, et ensuite 

d'extraire I'instruction comprimee de la memoire d'instructions comprimees et de commencer la decompres- 
sion de I'instruction comprimee suivant le champ de format extrait du support de stockage d'instructions com- 
primees (103) ainsi que I'instruction comprimee precedente. 

Processeur suivant la revendication 2, 3 ou 4, I'unite de decompression (155) extrayant le champ de format du 
support de stockage d'instructions comprimees (103) dans une unite d'acces a la memoire, I'unite d'acces a la 
memoire comprenant egalement au moins un sous-champ de partie d'operation, I'unite de decompression (155) 
integrant le sous-champ de partie d'operation dans au moins une des operations de I'instruction decomprimee. 

Processeur VLIW suivant la revendication 1 , dans lequel 

I'unite de decompression (1 55) est a meme d'extraire un train d'instructions comprimees du support de stoc- 
kage d'instructions comprimees (103) et de decomprimer les instructions comprimees, le train d'instructions 
comprenant : une premiere instruction incluant un champ de format qui specifie un format de compression d'ins- 
tructions, et dans lequel 

le train d'instructions comprend une deuxieme instruction, extraite du support de stockage d'instructions 
comprimees (103) suivant la premiere instruction, I'unite de decompression (155) etant a meme de decomprimer 
la deuxieme instruction suivant le champ de format dans la premiere instruction. 

Procede de production de code comprime a executer sur un processeur VLIW, le procede comprenant les etapes 
suivantes : 

la reception d'une instruction comprenant au moins deux operations ; 

la compression de chacune desdites au moins deux operations de I'instruction suivant un schema de com- 
pression respectif qui attribue une longueur d'operation comprimee respective a I'operation concernee, 

caracterise en ce que la longueur d'operation comprimee est choisie parmi une pluralite de longueurs finies, 
lesquelles longueurs finies incluent au moins deux longueurs non zero, celle parmi les longueurs finies qui est 
choisie dependant d'au moins une caracteristique de I'operation. 
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8. Precede suivant la revendication 7 applique a un train destructions comprenant ladite instruction, le procede 
comprenant I'etape de determination pour chaque instruction de si cette instruction est une cible de branchement 
(M) d'un branchement d'une autre instruction du train destructions, et de compression desseules instructions qui 
ne sont pas des cibles de branchement. 

9. Procede suivant la revendication 7 ou 8 comprenant I'etape de production d'un champ de format, le champ de 
format specifiant un format respectif pour chaque operation de I'instruction suivant la longueur d'operation com- 
primee choisie pour cette operation. 

10. Procede suivant la revendication 9 dans lequel le champ de format comprend N sous-champs, N etant le nombre 
de champs destructions emises, chaque sous-champ specifiant une longueur d'operation comprimee pour un 
champ destruction emise respectif, caracterise en ce que les sous-champs contiennent chacun au moins deux 
bits. 

11. Procede suivant la revendication 7, 8, 9 ou 10 comprenant I'etape de stockage d'une instruction comprimee con- 
tenant les operations comprimees dans un support de stockage destructions comprimees lisible par ordinateur 
(103). 

12. Procede suivant la revendication 11, lorsqu'elle depend de la revendication 9, comprenant la compression d'une 
autre instruction, pour execution avant I'execution de I'instruction, le champ deformat pour I'instruction etant stocke 
pour extraction avec I'autre instruction. 

13. Procede suivant la revendication 11, lorsqu'elle depend de la revendication 9, ou la revendication 12, le support 
de stockage destructions comprimees (103) comprenant des unites d'acces a la memoire. le champ de format 
etant stocke dans une meme unite d'acces a la memoire avec au moins un sous-champ de partie d'operation d'au 
moins une des operations de I'instruction. 

14. Procede suivant la revendication 11, lorsqu'elle depend de la revendication 9, dans lequel I'etape de reception 
comprend 

la reception d'un train destructions comprenant chacune une pluralite d'operations ; 

dans laquelle le train comprend une premiere instruction pour execution avant une deuxieme instruction du train, 
le champ deformat correspondant a la deuxieme instruction et I'instruction comprimee correspondant a la premiere 
instruction etant stockes dans le support de stockage destructions comprimees (103) pour une extraction com- 
binee pendant I'execution du train, avant I'extraction de I'instruction comprimee correspondant a la deuxieme ins- 
truction. 

15. Procede suivant la revendication 14 comprenant I'etape de determination pour chaque instruction de si cette ins- 
truction est une cible de branchement (i1) d'un branchement d'une autre instruction du train destructions, et de 
compression des seules instructions qui ne sont pas des cibles de branchement. 

16. Ordinateur programme pour executer le procede suivant I'une quelconque des revendications 7 a 15. 



Patentanspruche 

1 . VLIW-Prozessor fur die Verwendung komprimierter Befehle, wobei der Prozessor Folgendes umfasst: 

ein Befehlsausgaberegister (154) mit einer Vielzahl von Issueslots, wobei jeder Issueslot zur Speicherung 
einer jeweiligen Operation dient und alle Operationen in dem selben Taktzyklus mit der Ausfuhrung beginnen; 
eine Vielzahl von Funktionseinheiten (151, 152, 153) zur Ausfuhrung der im Befehlsausgaberegister (154) 
gespeicherten Operationen; 

eine Dekompressionseinheit (155), um das Befehlsausgaberegister (154) mit einem dekomprimierten Befehl 
zu versorgen, wobei die Dekompressionseinheit (1 55) einem Speichermedium fur komprimierte Befehle (1 03) 
einen komprimierten Befehl entnimmt und den komprimierten Befehl dekomprimiert, und wobei der kompri- 
mierte Befehl mindestens zwei Operationen enthalt, wobei jede der mindestens zwei Operationen auf eine 
entsprechende komprimierte Operationslange komprimiert ist, 



23 



EP 0 843 848 B1 



dadurch gekennzeichnet , dass die Dekompressionseinheit (155) dafur eingerichtet ist, Operationen mit jeweils 
aus einer Vielzahl endlicher Langen gewahlten komprimierten Operationslangen zu dekomprimieren, wobei die 
endlichen Langen mindestens zwei Langen enthalten, die nicht Null sind. 

2. Prozessor nach Anspruch 1 , wobei die Dekompressionseinheit (1 55) dafur eingerichtet ist, dem Speichermedium 
fur komprimierte Befehle (103) ein Formatfeld zu entnehmen, wobei das Formatfeld die jeweilige komprimierte 
Operationslange fur jede Operation des komprimierten Befehls spezifiziert, und wobei die Dekompressionseinheit 
(155) die Operationen des komprimierten Befehls gemaB dem Formatfeld dekomprimiert. 

3. Prozessor nach Anspruch 2, wobei das Formatfeld N Teilfelder hat, wobei N die Anzahl der Issueslots ist und jedes 
Teilfeld eine komprimierte Operationslange fur einen jeweiligen Issueslot spezifiziert, dadurch gekennzeichnet , 
dass die Teilfelder jeweils mindestens zwei Bits enthalten. 

4. Prozessor nach Anspruch 2 Oder 3, wobei die Dekompressionseinheit (155) dafur eingerichtet ist, 

dem Speichermedium fur komprimierte Befehle (1 03) zusammen mit dem Formatfeld einen vorhergehenden 
komprimierten Befehl zu entnehmen, 

mit dem Dekomprimieren des vorhergehenden komprimierten Befehls zu beginnen, und anschlieBend 
dem Speicherfur komprimierte Befehle den komprimierten Befehl zu entnehmen und mit dem Dekomprimieren 
des komprimierten Befehls gemaB dem Formatfeld zu beginnen, das dem Speichermedium fur komprimierte 
Befehle (103) zusammen mit der vorhergehenden komprimierten Befehl entnommen wurde. 

5. Prozessor nach Anspruch 2, 3 oder 4, wobei die Dekompressionseinheit (155) das Formatfeld dem Speicherme- 
dium fur komprimierte Befehle (103) in einer Speicherzugriffseinheit entnimmt, wobei die Speicherzugriffseinheit 
weiterhin mindestens ein Operationsteilfeld umfasst und die Dekompressionseinheit (155) das Operationsteilfeld 
in mindestens eine der Operationen des dekomprimierten Befehls einbezieht. 

6. VLIW-Prozessor nach Anspruch 1 , wobei die Dekompressionseinheit (155) dafur eingerichtet ist, dem Speicher- 
medium fur komprimierte Befehle (1 03) einen Strom komprimierter Befehle zu entnehmen und die komprimierten 
Befehle zu dekomprimieren, und wobei der Befehlsstrom einen ersten Befehl umfasst, der ein Formatfeld enthalt, 
welches ein Befehlskompressionsformat spezifiziert, wobei der Befehlsstrom einen zweiten Befehl umfasst, der 
dem Speichermedium fur komprimierte Befehle (103) nach dem ersten Befehl entnommen wird ; und wobei die 
Dekompressionseinheit (155) dafur eingerichtet ist, den zweiten Befehl gemaB dem Formatfeld in dem ersten 
Befehl zu dekomprimieren. 

7. Verfahren zurErzeugung eines komprimierten Codes, der auf einem VLIW-Prozessor ablauft, wobei das Verfahren 
folgende Schritte umfasst: 

Empfangen eines Befehls mit mindestens zwei Operationen 

Komprimieren jeder der genannten mindestens zwei Operationen des Befehls gemaB einem entsprechenden 
Kompressionsschema, das der betreffenden Operation eine entsprechende komprimierte Operationslange 
zuordnet, 

dadurch gekennzeichnet , dass die komprimierte Operationslange aus einer Vielzahl endlicher Langen ausge- 
wahlt wird, wobei die endlichen Langen mindestens zwei Langen enthalten, die nicht Null sind, und die Auswahl 
einer der endlichen Langen von mindestens einem Merkmal der Operation abhangt. 

8. Verfahren nach Anspruch 7, angewendet auf einen Befehlsstrom, der den genannten Befehl enthalt, wobei das 
Verfahren den Schritt umfasst, fur jeden Befehl zu bestimmen, ob dieser Befehl das Sprungziel (i1 ) einer Verzwei- 
gung von einem anderen Befehl des Befehlsstroms ist, und nur solche Befehle zu komprimieren, die keine Sprung- 
ziele sind. 

9. Verfahren nach Anspruch 7 oder 8, umfassend den Schritt zur Erzeugung eines Formatfeldes, wobei das Format- 
feld das jeweilige Format fur jede Operation des Befehl gemaB derfurdiese Operation gewahlten komprimierten 
Operationslange spezifiziert. 

10. Verfahren nach Anspruch 9, wobei das Formatfeld N Teilfelder hat, N die Anzahl der Issueslots ist, und jedes 
Teilfeld eine komprimierte Operationslange fur einen jeweiligen Issueslot spezifiziert, dadurch gekennzeichnet , 
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dass die Teilfelder jeweils mindestens zwei Bits enthalten. 

11. Verfahren nach Anspruch 7, 8, 9 oder 10, umfassend den Schritt einen die komprimierten Operationen enthal- 
tenden komprimierten Befehl in einem computerlesbaren Speichermedium fur komprimierte Befehle (1 03) zu spei- 
chern. 

1 2. Verfahren nach Anspruch 1 1 , wenn in Abhangigkeit von Anspruch 9, umfassend das Komprimieren eines weiteren 
Befehls, der vor dem Ausfiihren des vorangehenden Befehls auszufuhren ist, wobei das Formatfeld fur den Befehl 
zur Abholung mit dem weiteren Befehl gespeichert wird. 

13. Verfahren nach Anspruch 11 , wenn in Abhangigkeit von Anspruch 9, oder Anspruch 12, wobei das Speichermedium 
fur komprimierte Befehle (103) Speicherzugriffseinheiten hat, und das Formatfeld in derselben Speicherzugriffs- 
einheit mit mindestens einem Operationsteilfeld von mindestens einer der Operationen des Befehls gespeichert 
wird. 

14. Verfahren nach Anspruch 11 , wenn in Abhangigkeit von Anspruch 9, wobei der Schritt des Empfangens Folgendes 
umfasst: 

Empfangen eines Befehlsstroms mit einer Vielzahl von Operationen, 

wobei der Strom einen ersten Befehl zur Ausfuhrung vor einem zweiten Befehl aus dem Strom umfasst, das zum 
zweiten Befehl gehorende Formatfeld und der zum ersten Befehl gehorende komprimierte Befehl im Speicherme- 
dium fur komprimierte Befehle (103) gespeichert werden, um wahrend der Ausfuhrung des Stroms zusammen 
abgerufen zu werden, bevor der zum zweiten Befehl gehorende komprimierte Befehl abgerufen wird. 

15. Verfahren nach Anspruch 14, umfassend den Schritt. furjeden Befehl zu bestimmen, ob dieser Befehl das Sprung- 
ziel (i1) einer Verzweigung von einem anderen Befehl des Befehlsstroms ist, und nur solche Befehle zu kompri- 
mieren, die keine Sprungziele sind. 

16. Computer, der fur die Ausfuhrung des Verfahrens gemaf3 einem der Anspruche 7 bis 15 programmiert ist. 
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FIG. 1 A 



26 



EP 0 843 848 B1 



151 



152 



153 
■xJ 



155 



t ' 


i i 


i 


I 


154^ ' 


i 





t 



FIG. IB 



27 



EP 0 843 848 B1 




FIG. 2A 



FIG. 2B 
FIG. 2C 

FIG. 2D 
FIG. 2E 



EP 0 843 848 B1 



— CNJ 



l 

CO 



CNJ 



CO 



CM 



CO 



X 



29 



EP 0 843 848 B1 



FIG. 4A 
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FIG. 7 
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